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Abstract 

Background: Fungal plant pathogens cause serious agricultural losses worldwide. Alternaria arborescens is a major 
pathogen of tomato, with its virulence determined by the presence of a conditionally dispensable chromosome 
(CDC) carrying host-specific toxin genes. Genes encoding these toxins are well-studied, however the genomic 
content and organization of the CDC is not known. 

Results: To gain a richer understanding of the molecular determinants of virulence and the evolution of 
pathogenicity, we performed whole genome sequencing of A arborescens. Here we present the de-novo assembly 
of the CDC and its predicted gene content. Also presented is hybridization data validating the CDC assembly. 
Predicted genes were functionally annotated through BLAST. Gene ontology terms were assigned, and conserved 
domains were identified. Differences in nucleotide usage were found between CDC genes and those on the 
essential chromosome (EC), including GC3-content, codon usage bias, and repeat region load. Genes carrying PKS 
and NRPS domains were identified in clusters on the CDC and evidence supporting the origin of the CDC through 
horizontal transfer from an unrelated fungus was found. 

Conclusions: We provide evidence supporting the hypothesis that the CDC in A arborescens was acquired through 
horizontal transfer, likely from an unrelated fungus. We also identified several predicted CDC genes under positive 
selection that may serve as candidate virulence factors. 

Keywords: Alternaria arborescens, lllumina sequencing. Conditionally dispensable chromosome. Horizontal gene 
transfer, Polyketide synthase. Host specific toxins 



Background 

The rapid development of next-generation sequencing 
technologies over the past decade has led to a flood of 
both de-novo sequencing and re-sequencing projects in 
almost every branch of the tree of life. Within the fungal 
kingdom, comparative genome studies have led to the 
unexpected finding that large genomic regions may be 
variable among isolates of a given species. One category 
of these variable regions are unique chromosomes 
referred to as supernumerary or conditionally dispensable 
because they are not typically required for saprophytic 
growth [1-3]. These chromosomes have been identified 
in many fungi including Magnaporthe oryzae [4-6], 
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Fusarium oxysporum [7], Nectria haematococca [8,9], 
Mycosphaerella graminicola [10], Cochliobolus heterostro- 
phus [11], Leptosphaeria maculans [12], and Alternaria 
alternata [13,14]. 

Plant pathogenic fungi in the genus Alternaria infect a 
remarkable range of host plants and are major causes of 
agricultural yield losses [15]. Conditionally dispensable 
chromosomes (CDCs) are carried by several of the 
small-spored, plant-pathogenic Alternaria species 
[13,14,16]. These chromosomes are generally less than 
2.0MB in size, and may be transmitted horizontally 
between isolates in a population, potentially conferring 
new pathogenic attributes to the receiving isolate [17-20]. 
Loss of the CDC can also occur during repeated sub- 
culturing, resulting in the transition from a pathogenic to 
saprophytic form of the fungus [13]. Several genes coding 
host specific toxins (HSTs) have been located to gene 
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clusters on CDCs, including those producing AF-toxin 
from the strawberry pathotype [21], AK- toxin from the 
Japanese pear pathotype [22], and ACT- toxin from the 
tangerine pathotype [23]. These toxins share a common 
9,10-epoxy-8-hydroxy-9-methyl-decatrienoic acid struc- 
tural moiety, with the genes encoding each toxin sharing a 
high degree of homology [21-25]. In addition, the AMT 
gene from the apple pathotype, a gene involved in 
host-specific AM-toxin cyclic peptide biosynthesis, is 
located on a small chromosome of 1.1 to 1.7 Mb [13,26], 
with at least four copies involved in AM-toxin biosynthesis 
[27]. The only other gene sequences identified to date on 
CDCs are extended families of transposon-like sequences 
(TLSs) [14]. 

Horizontal gene transfer (HGT) is the movement, 
without recombination, of stable genetic material 
between two individuals [28]. HGT may not only occur 
between different individuals of the same species, but 
also between species or even between bacteria and fungi 
or between fungi and oomycetes [29,30]. In fungi, the 
movement of plasmids, mycoviruses, transposable 
elements, gene clusters, and whole chromosomes have 
been demonstrated from one individual to another [31]. 
The first theory to explain gain and loss of HSTs was 
proposed in 1983 [32]. It has then been hypothesized 
that the genome content of CDCs in Alternaria species 
were acquired through HGT events [14]. The most well 
studied example of HGT in fungi is the movement of the 
ToxA gene from the wheat blotch pathogen Stagonospora 
nodorum to Pyrenophora tritici-repentis, the causal agent 
of tan spot of wheat [33,34]. This horizontal transfer 
event was identified by nucleotide sequence similarity 
and structural comparisons between genes from both 
species. The direction of transfer was inferred by the fact 
that the ToxA gene consisted of a single haplotype in P. 
tritici-repentis but 11 haplotypes in S, nodorum isolates. 

Alternaria arborescens (synonym A. alternata f. sp. 
lycopersici), the fungus that produces host-specific AAL 
toxin, is the causal agent of stem canker of tomato 
[35,36]. It has been observed in pulsed field gel electro- 
phoresis (PFGE) studies that A, arborescens carries one 
CDC of 1.0-Mb [16,37]. To date, only two genes have 
been reported to be carried on this CDC including 
ALTl, which is a PKS gene involved in AAL toxin 
biosynthesis [38,39], and AaMSAS, also a PKS gene 
[40,41]. A CDC deletion mutant of A. arborescens gener- 
ated through restriction enzyme mediated integration 
(REMI) showed a toxin and pathogenicity minus 
phenotype [41]. In addition, in protoplast fusion experi- 
ments, a CDC from A, arborescens was observed to 
transfer into the strawberry pathotype, and subsequently 
introduced new tomato pathogenicity to the fusant [41]. 

In this study, we used a next generation sequencing 
approach to produce a draft sequence of the A, 



arborescens genome and used a novel bioinformatics 
approach to separate CDC contigs from the essential 
chromosome (EC) contigs. The gene content of the CDC 
was analyzed to answer the following questions: (1) What 
is the difference between the CDC and EC genome 
content at the nucleotide level? (2) Are CDC genes 
under positive selection and could they represent 
additional virulence factors in addition to the known 
toxin encoding genes? (3) Is the evolutionary history of 
the CDC the same as that of the ECs, and is there any 
evidence of a HGT event? In answering these questions, 
we confirmed a different genome content pattern of the 
A, arborescens CDC and found evidence for HGT. 

Results 

Sequencing & assembly 

A. arborescens strain EGS 39-128 (CBS 102605) [42] was 
sequenced by a whole genome shotgun approach using 
the Illumina Genome Analyzer II, which resulted in ~50 
million paired-end short reads of 75 bp representing 90X 
average coverage of the predicted genome content. 
De-novo assembly was performed using Velvet [43] 
(version 0.7), and confirmed by Edena [44] and Mini- 
mus2 [45]. The assembly resulted in 1,332 contigs with a 
N50 of 624KB and total size of 34.0MB (Additional file 
1: Table SI; Assembly has been deposited at DDBJ/ 
EMBL/GenBank under the accession AIICOOOOOOOO. The 
version described in this paper is the first version, 
AIICOIOOOOOO.) One hundred thirty-seven large 
contigs with lengths greater than 10KB and representing 
98% of the genome assembly content were chosen for 
further analysis. 

Marker-assisted identification of contigs carrying toxin 
biosynthetic genes 

The first challenge in analyzing the CDC was to isolate 
its assembly contigs away from EC contigs. For this 
genome, the process was made more challenging as there 
is no defined reference genome, few genetic markers, and 
no optical map. It is known from previous studies that 
most Alternaria species, including the isolate used in this 
study, have a single CDC [16,37]. To begin assembly of the 
this chromosome, two previously identified CDC genes 
that belong to the toxin biosynthetic cluster, ALTl and 
AaMSAS [38,40], were used as markers to search in all 
contigs. Through this strategy, two putative CDC contigs 
of 15 KB and 48 KB in length were identified as containing 
ALTl and AaMSAS, respectively. These two contigs were 
annotated to identify PKS genes, other toxin biosynthetic 
genes, as well as genes with orthologs in other fungi and 
bacteria (Figure 1). Multiple putative HST genes were 
identified on both contigs, consistent with predictions 
based on previous reports [14]. 
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AaMSAS_contJg. 48.9KB AaMSAS 




ID 


Description 


Identity 


E-value 


1 


Aiternana aitemata DNA, AM I genes region 


84% 


2.00b-38 


2 


Pyrenophora teres/, teres 0-1 hypothetical protein 


74% 


1 .OOE-141 


3 


C ochltoholus heterostrophtts PKSl 1 gene 


73% 


0 


4 


Alternaria aitemata ALTl gene 


99% 


0 


5 


Streptomyces aizimensis glycosyltransferase, PKS 


79% 


2.00E-13 


6 


Alternaria aitemata AFTl-1 gene 


79% 


3.00E-23 


7 


Leptosphaeria maculanslm SuperContig 1 v2 


97% 


4.00E-33 


8 


Neurospora crassa OR74A hypothetical protein 


73% 


2.00E-04 


9 


Alternaria aitemata ACTT2-2 gene 


74% 


l.OOE-13 


10 


Pyrenophora teres f. teres 0-1 hypothetical protein 


69% 


2.00E-50 


11 


Pyrenophora tritici-repentis Pt- 1 C-BFP 


11 Vo 


3.00E-91 


12 


Pyrenophora tritici-repentis Pt- 1 C-BFP 


67% 


9.00E-42 


13 


Alternaria aitemata DNA, AM T genes region 


71% 


l.OOE-13 


14 


Cochliobolns heterostrophiis PKS25 gene 


77% 


0 


15 


Alternaria aitemata AaMSAS gene 


100% 


0 


16 


Alternaria aitemata AF-toxin biosynthesis gene 


88% 


7.00E-05 


17 


Alternaria aitemata gene for Akt3-1 


79% 


3.00E-16 



Figure 1 Annotation of the initial two CDC contigs. The two lines represent two contigs containing CDC marker genes. Eacli box represents a 

BLAST annotation witli tine direction indicated by arrows, 
k ) 



Identification of the remaining CDC contigs and validation 
by Southern hybridization 

To identify additional CDC contigs, the Alternaria 
brassicicola {Ah) genome sequence was used as a refer- 
ence (downloaded from The Genome Institute at 
Washington University). A, brassicicola is a related 
species to A, arborescens {Aa) but does not carry CDCs. 
All contigs from Aa were aligned to Ab contigs using 
MUMmer [46] as the alignment tool with an identity 
cut-off at 90%. Eight previously identified marker genes 
from Aa, 6 from the EC and 2 from CDC, were used to 
set criteria to distinguish contigs belonging to CDCs 
versus ECs [41]. (Table 1) After comparing the align- 
ments of contigs containing the 8 marker genes, we 
set a CDC contig cut-off as those contigs with less 
than 20% coverage of sequence aligned to Ab with 
higher than 90% identity (including both coding and 
non-coding regions). Through this method, 29 



predicted CDC contigs were identified with total 
length of 1.0Mb, the same as the expected size from 
clamped homogenous electric fields (CHEF) gel ana- 
lysis [16,37] (Figure 2A). The remaining 108 contigs 
were considered essential chromosomes (ECs) with 
total genome size of 32.3MB (Additional file 2: Figure 
SI). Validation that the selected contigs belong to the 
CDC was performed by Southern hybridization using 
genes predicted to reside on the CDC and EC contigs 
respectively as probes (Figure 2C-F). Five probes were 
hybridized including 4 from the CDC and 1 from the 
EC. Each probe gene is predicted to be present in a 
single copy. The first probe was ALTl, a toxin gene 
known to reside on the CDC [41]. This was followed 
by hybridizing with three CDC genes predicted from 
the de-novo assembly, including a transporter 
CDC_92, a polysaccharide export protein CDC_102, 
and an o-methyltransferase CDC_147. The fifth probe 
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Table 1 Alignment of A, arborescens marker gene contigs and A. brassicicola contigs 



Contig ID 

N0DE_9 

NODE_136 

NODE_758 

N0DE_8 

NODE_58 

NODE_82 

NODE_274 

N0DE_151 



Length (bp) 



14729 

174048 

199480 

833831 

1201916 

1472031 

812677 



Alignment coverage 

13.6% 
15.7% 
26.1% 
44.1% 
61.9% 
69.9% 
72.1% 
73.5% 



Marker gene 

AoMSAS 

ALT! 

AKS17 

AKS21 

MAT! -2-1 

ALM 

VKS2 

AaTUB 



New ID 

CDC_contig_23 

CDC_contig_27 

EC_contig_009 

EC_contig_038 

EC_contig_063 

EC_contig_085 

EC_contig_091 

EC_contig_094 



GC% 

53.015% 
24.942% 
51.127% 
51.143% 
51.209% 
51.495% 
51.^ 
51.: 



The first two contigs contain two CDC marker genes and the following six contigs contain six EC marker genes. 



was EC_97_90_g721, a gene annotated as PKS and 
predicted to reside on the ECs. 

Gene prediction, length, GC3-content, and repeat 
identification 

Nine thousand, one hundred sixty-seven genes were 
predicted by FGENESH [47] using pre-trained Alternaria 
parameters, of which 209 genes were assigned to CDC 
contigs and 8958 to EC contigs. The average length of 
each predicted gene was 1.8 KB, and the gene density 
was 3.7KB per gene. Compared to gene predictions for 
A, brassicicola (average gene length = 1.3KB, gene 
density = 3.0KB per gene). A, arborescens genes were 
longer and present in lower density. To evaluate the 
origin of the CDC, the predicted genes residing on the 
CDC and EC contigs were compared at the nucleotide 
level, including gene length, GC3 -content [48], repeat 



load, and codon usage bias. This analysis showed that 
CDC genes are about 200bp shorter on average than EC 
genes (P = 2.36E-09) and have significantly lower GC3- 
content (P = 0.028). Repeat regions composed 5.3% of 
CDC contigs while only 0.6% of EC contigs (Additional 
file 1: Table S2). It should be noted that some repeat 
regions could be lost in short read sequences de-novo 
assemblies, however, even with possible suppressed 
numbers, this result indicates approximately lOX repeat 
enrichment in the CDC compared to the EC. 

Codon usage analysis 

Codon usage comparisons of CDC and EC genes were 
used to determine whether a bias in codon usage exists 
between the two groups. Both the Codon Adaptation 
Index [49] (CAI, P = 0, Figure 3) and Relative Synonymous 
Codon Usage (RSCU) [50] correlation (P=1.14E-14, 



5.7Mb- 



4.6Mb- 



3.5Mb- 



2.0Mb- 




- EC gene 
hybridization 



- CDC genes 
hybridization 



Figure 2 Southern hybridization to validate CDC contigs prediction. (A) Chromosomes of /4. arborescens separated by CHEF. Hybridization of 
(B) ALTl gene, three predicted CDC genes including (C) CDC_92, (D) CDC_102, (E) CDC_147, and (F) a predicted EC gene EC_97_90_g721 are 
shown. It should be noted that the membrane used for hybridization was prepared in 2004, so only a limited number of hybridization 
experiments were possible. 
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0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 

Codon Adaptation Index (CAI) 

Figure 3 The Codon Adaptation Index (CAI) distribution of A arborescens CDC genes compared to EC genes. The CAI derived from the 
RSCU estimations was computed using "Automated Codon Usage Analysis Software" (ACUA). 



Additional file 1: Table S3) from the two groups were 
significantly different, suggesting a different origin. The 
largest codon usage bias was observed for the amino acids 
Tyrosine, Lysine, and Asparagine, with a preference for A 
over G and T over C in CDC genes (Additional file 1: Table 
S4). These three amino acids were not biased for CDCs 
in Fusarium [7], indicating there's no universal CDC codon 
usage bias pattern between Alternaria and Fusarium, 

Annotation of EC genes 

The assembly results showed the size of essential chro- 
mosomes region collectively to be 33.0 MB with 8958 
predicted genes. RepeatMasker identied only 0.12% of 
the EC region as simple repeats (about 50bp in length) 
and 0.08% as low complexity, indicating that short 
repeats may be lost during de novo assembly of Illumina 
sequencing reads. For secreted protein identification, 
1099 (12.2%) of the EC proteins were predicted to 
contain signal peptides, and were functionally annotated 
using BLAST to the NCBI database with more than 98% 
of the genes returning at least one hit with an E-value 
<1.0E-3. From the BLAST results, we identified 212 
transcription factors, 98 oxidase proteins, 202 kinase 
proteins, 279 transporters, 81 Cytochrome P450s, and 45 
different proteases. 

Annotation of CDC genes 

Several host-specific toxin genes and transposon-like 
sequences have been reported to be carried by CDCs in 
Alternaria [14]. We used two methods to annotate the 



functions of resident CDC genes: (1) they were blasted 
against the NCBI non-redundant database as well as 
Pfam [51] and NCBI CDD [52] to search for functional 
domains; (2) they were scanned to identify transcription 
factors, PKS genes, NRPS genes, P450s, transporters, and 
pathogenicity related genes. 

From 160 NCBI BLASTN hits of putative CDC 
genes (Figure 4), the top five species matches were 
Pyrenophora teres, Pyrenophora tritici-repentis, 
Phaeosphaeria nodorum, Leptosphaeria maculans, and 
Nectria haematococca, all of which are fungal phyto- 
pathogens. Interestingly, A. alternata was ranked 1^^ 
in this list, demonstrating that CDC genes were more 
similar to genes present in other fungal species rather than 
other Alternaria spp. Moreover, besides N, haematococca, 
all these other fungi are closely related taxonomically 
belonging to the class, Dothideomycetes. 

Gene ontology terms were assigned to CDC genes 
based on BLAST matches with sequences whose 
function was previously characterized [53]. Ninety CDC 
genes were assigned to a biological process, 51 for 
molecular function, and 15 for cellular component 
(Additional file 2: Figure S2). Among the biological 
process assignments, 54% of genes were assigned to 
"metabolic process", and 10% to "biosynthetic process". 
Enrichment of metabolic and biosynthetic process in 
CDC genes as compared to EC genes supported the 
observation that Alternaria CDC genes were enriched 
for polyketide synthases (PKS) and toxin synthases. 
Molecular function terms showed a significant 
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BLAST Top-Hits 

15 20 25 



Pyrenophora teres 
Pyrenophora tritici-repentis 
Phaeosphaeria nodorum 
Leptosphaeria maculans 
Nectria haematococca 
Verticillium albo-atrum 
Alternaria afternata 
Aiellomyces capsulatus 
Botryotinia fuckeliana 
Gibberella zeae 
Glomerella graminicola 
^Aspergillus davatgs 
Gibberella monilitormis 
w Aspergillus terreus 

■g Aspergillus oryzae 

0) Talaromyces stipitatus 

Q. Cochliobolusheterostrophus 
^ Penicillium marneffei 

Sclerotinia sclerotiorum 
Aspergillus niger 
Fusarium oxysporum 
Arthroderma gypseum 
Grosrnannia ciavigera 
Metarhi.zium anisopliae 
Uncinocarpus reesii 
Ustilago maydis 
Trichomonas, vaginalis 
Acaryochloris marina 
Agrobacterium vitis 
others 




Figure 4 BLAST taxonomy report of all CDC genes against NCBI "Nucleotide collection (nr/nt)" database. Species of BLAST hits 
(E-value < l.OE-3) were ranl<ed by tlieir occurrence. 



percentage (39%) to "nucleotide/nucleic acid binding", 
which showed an enrichment of transcription factors 
and gene regulation elements. 

To provide a more detailed characterization of putative 
CDC genes, each was translated to identify protein families. 
Among the 209 predicted CDC proteins, 31 were identified 
as carrying PKS domains. Two proteins were found to carry 
highly modular domains: KS-AT-KR-ACP on CDC_141 
and KS-AT-DH-ER-KR-ACP on CDC_165. The remaining 
29 PKS proteins each carried 1 or 2 ACPs (Acyl carrier 
protein) domains. Seven proteins were found to carry 
NRPS domains: 3 Enterobactin domains, 2 Bacitrancin 
domains, 1 Pyochelin domain, and 1 CDAl domain. Two 
proteins were identified as hybrid PKS-NRPS. Seven 
proteins were identified as P450 monooxygenase proteins. 
For transcription factors, 24 proteins were characterized to 
contain TF domains, in which Zn2Cys6 was the prominent 
group. Multiple ADP/ATP transporters, ABC transporters, 
ion transporters and major facilitator superfamily (MFS) 
transporters were also found in CDC protein group. 
Additionally, it was found that multiple proteins carrying 
FAD binding domains and oxidoreductases. Finally, 37 
proteins were identified as putative pathogenicity related 
genes through scanning CDC genes in the pathogen-host 
interactions database (PHI-base) [54] (E-value < 0.05). See 
Additional file 3 for a complete CDC gene annotation list. 

Secondary metabolite biosynthetic gene clusters 

In fungi, it has been reported that genes responsible for 
secondary metabolite biosynthesis (SMB) may be 



clustered [4,55]. Typically these include PKS or NRPS 
genes, as well as genes responsible for structural modifi- 
cations of initial metabolites, for transport, and for 
transcription regulation [56]. In this study, we screened 
each CDC gene and those surrounding them, looking for 
evidence of clustering of PKS, NRPS, transcription 
factors, transporters, P450 proteins, FAD binding 
proteins, transferases, and oxidoreductases. We identified 
10 putative SMB clusters (Figure 5). A typical SMB cluster 
is formed by 3-6 genes, with 1 or 2 PKS or NRPS genes, 
and other metabolite syntheses related genes. 

Evolutionary selection of CDC genes and domains 

To estimate selection on CDC and EC genes, Ka/Ks 
ratios were calculated, with the assumption that genes 
with Ka > Ks were likely under positive selection, genes 
with Ka = Ks were likely evolving neutrally, and genes 
with Ka < Ks were likely under purify^ing (negative) selec- 
tion. Twenty-eight CDC and 6,036 EC genes were 
successfully aligned to A. brassicicola genes and Ka and 
Ks values were calculated for each. It was observed CDC 
genes had about a double Ka (0.08/0.043) and larger Ka/ 
Ks ratios (0.133/0.084) than EC genes (Figure 6), possibly 
indicating greater positive selection on CDC genes. The 
two CDC genes with highest Ka/Ks ratio was CDC_102 
(PKSs) and CDC_146 (phosphotransferase). However, no 
CDC genes showed Ka/Ks > 1, suggesting that in these 
two Alternaria species strong positive selection may only 
occur in specific regions of a protein. The selection ratio 
only at conserved domains of CDC genes was then 
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PK8 



hydroxysteroid 
dehydrogenase 



NODE 136 



P450 



PKStypel 



Pyochelin 
(NRPS) 



PKStypell 



NODE 89. 



NODE 152 



PKStypel Unknown Unknown PKStypell PKS type II PKS type II D ihydroxy-acid Zn2Cys6TF 

NODE 9 ^^^H— ^^H— ^^H— ^^^H-- NODE_163-^^H— ^^^^ 



Enterobactin Phosphate 
(NRPS) Transferase MPS toxin 



NODE 309 < 



NODE 61 



NODE 100 • 



NODE_977 



Bacitracin 

Unknown (NRPS) PKStypell 



NODE 85 



PKStypell PKStypell duf521 domain n-gnat Mg2+ transporter Unknown Methyltransferase TFjumonjityoe 



Transcription C2H2 Pyridine 

)xiaore 



P450 Cariboxylesterase termination factor zinc finger PK^vp^l oxidoreducatase 



Integral membrane 

family protein Unknown PKS type II ABC transporter PKS type II 



Color code: 
PKS: Red 

Transcription factor: Yellow 
Transporter: Purple 



NRPS: Orange 
Transferase: Green 
Others: Black 



P450: Blue 

Oxidoreducatase: Gray 



Figure 5 Structure of predicted secondary metabolite biosynthesis (SMB) clusters. Each block represents a predicted CDC gene with the 
annotation listed above the block. Contig (NODE) numbers and color legend are given. Type I PKSs are multifunctional enzymes that are 
organized into modules, each harbors a set of distinct, non-iteratively acting activities responsible for the catalysis, while type II PKSs are 
multi-enzyme complexes carrying a single set of iteratively acting activities. 



estimated. Domains of aligned CDC genes were identi- 
fied using the NCBI CDD database. Each individual 
domain was extracted then the Ka/Ks ratio was calcu- 
lated and compared to that from same full length protein 
(Additional file 1: Table S5). We found two domains 
from CDC_151 with a higher Ka/Ks ratio compared to 
whole length protein: a 12x increase for the haloacid 
dehalogenase-like hydrolases (HADJike) domain, which 
uses a nucleophilic aspartate in their phosphoryl transfer 
reaction, and 2x increase for heavy- metal-associated 
(HMA) domain, which transports or detoxifies heavy 
metals. Another interesting example was CDC_144, whose 
domain patatin-like phospholipase (Patl7, belonging to 
the alpha-beta hydrolase family) showed a 1.5x increase 
compared to whole length protein. 

Origin of CDC 

In the taxonomy report of CDC BLAST results, top hits 
came primarily from closely related dothideomycete 
fungi, indicating CDC may have fungal origin, or a trans- 
fer event occurred between CDC content and one or 
more fungal genomes. To test whether CDCs have the 
same phylogenetic placement with ECs, a phylogenetic 



analysis was conducted, including EC and CDC genes 
from A. arborescens, genes from A, brassicicola, and 
from three other ascomycete species: P, tritici-repentis, 
L, maculans, and A, oryzae. Proteins coded by 6 genes 
showing homology in all 6 groups were used to build a 
distance tree (Figure 7) using the neighbor- joining 
method [57]. Results show the CDC clade was within 
but basal to the two Alternaria clades. 

A BLAST score ratio (BSR) [58] analysis was 
performed to test whether individual proteins on the A, 
arbor escens CDC had more similarity to A, brassicicola 
or other fungi, and the result was compared with the 
same analysis to EC proteins. Complete genome protein 
sequences of three fungal species: P, tritici-repentis, L 
maculans, and A, oryzae were extracted and built into a 
library called "3 -fungi", representing proteins from 
closely related fungal species. Then proteins from the 
CDC and ECs were compared to the A, brassicicola 
protein library and "3 -fungi" protein library respectively 
(Figure 8). It was clear, that there was less divergence 
between EC proteins and A, brassicicola proteins 
compared to that with other fungi (35.5% vs 15.1%), 
consistent with the species phylogeny. In contrast, CDC 
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Figure 6 Average Ka/Ks ratio, Ka, and Ks for all aligned EC genes and CDC genes. 6323 EC proteins and 34 CDC proteins were involved and 
Codem program in tine PAML suite was used (Capital "C", add "was"). 



proteins were more similar to proteins from other fungi 
(16.7% vs 10.0%), suggesting they have different 
evolutionary history other than EC proteins. 

Discussion 

Comparison to other CDC containing fungi 

Compared to other recently published assemblies of CDCs 
in filamentous fungi, A. arborescens has a relatively small 
number of CDCs (one) and the size (1.0Mb) is small. M, 
graminicola W2is reported to have the highest number of 
dispensable chromosomes with upwards of 8 ranging in 
size from 0.39 to 0.77MB [59]. Three CDCs in K 
haematococca [9,60], and 4 complete CDCs and partial 



region of another 2 in F, oxysporum [7] were identified. In 
other Alternaria species, identified CDCs are relatively 
larger such as 1.05Mb in the strawberry pathotype [14], 
1.1 to 1.7 Mb (depending on strains) in the apple patho- 
type [13], and 4.1 Mb in the Japanese pear pathotype 
[61,62]. In A. arborescens, only 1 dispensable chromosome 
is present, representing only 3% of the genome content, 
which is significantly smaller than other cases and may 
suggest a more recent acquisition or different origin. 

PKS and NRPS clusters 

Phytopathogenic fungi produce a diverse array of second- 
ary metabolites, including host-selective toxins conferring 
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Figure 7 Phylogenetic relationship of proteins in A. arborescens CDC shows discordance compared to EC proteins. The phylogenetic 
distance tree was constructed by "MEGA 5" using six protein sequences showing homology in all six groups. 
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Figure 8 Scatter plots of BLAST Score Ratio (BSR) of EC proteins (A) and CDC proteins (B). The numbers in yellow regions indicate the 
percentage of genes that either lack homologous sequences (lower left corner) or contain homologous sequences in both A brossicicolo {Ab) 
library and 3-fungi library (upper right corner). The numbers in pink regions (upper left) indicate the percentage of genes with homologous 
sequences in 3-fungi library but not in Ab library, while numbers in green regions (lower right) indicate the percentage of genes with 
homologous sequences in the Ab library but not in the 3-fungi library. 



pathogenicity [63]. It was reported in two basidiomycete 
maize pathogens candidate effector genes were located in 
small clusters that were dispersed throughout (both "are" 
change to "were") the genome [64]. However, in some 
other fungi, especially ascomycetes, genes coding for 
toxins can co-locate in clusters consisting of more than 10 
contiguous genes. A well-known example is the trichothe- 
cene biosynthetic gene cluster in F. graminearum which 
contains 10-12 genes including a terpene synthase 
gene, P450 monooxygenase genes, acyl transferase genes, 
regulatory genes, and transporter genes [55]. While in A, 
fumigatus, 26 SMB clusters were identified, each contain- 
ing 5-48 genes [65]. In our study, 29 PKS, 5 NRPS, and 2 
hybrid PKS -NRPS genes were found on the CDC, with lar- 
ger density compared to other fungi. However, among 10 
predicted SMB clusters, most were relatively small and 
only carried 3-8 genes, which may not represent the true 
cluster size due to short contigs length that may divide 
one large cluster into two or more. One example of an 
identified cluster was located on contig Node_309 which 
consists of 5 genes, including 2 PKSs, 1 NRPS putatively 
coding for enterobactin, a phosphate transferase gene, and 
a MFS transporter. It lacks regulators, P450s, and trans- 
porters compared to other typical clusters. However, this 
cluster locates at the edge of the contig. Only 5 genes away 



from this cluster, another small cluster containing PKS, 
NRPS, P450s and an ABC transporter was identified, 
suggesting these two could be part of a larger cluster (see 
Additional file 3: Supplementary CDC annotation list). In 
this study, PKS genes were identified by screening the PKS 
sequence database, especially the domain database, which 
include: ketoacyl synthase (KS), acyl transferase (AT), 
ketoreductase (KR), dehydratase (DH), enoyl reductase 
(ER), and acyl carrier protein (ACP, also known as PP 
domain). The KS, AT, and ACP domains are essential for 
PKS genes [66]. Two PKS genes were identified to have 
multiple domains above: KS-AT-KR-ACP in CDC_141, 
and KS-AT-DH-ER-KR-ACP in CDC_165. The remaining 
29 PKS genes each carries 1 or 2 ACP domains. Despite 
these conserved domains, other domains carried by 
these genes were divergent, indicating variance and 
multifunction of each PKS genes (Additional file 1: Table 
S6). However, at least 3 domain families were found to be 
enriched in the indentified PKS genes: ABC_membrane 
(4 identified), NADB_Rossmann (7 identified), and P-loop 
NTPase (6 identified), suggesting these proteins are trans- 
membrane and catalyzing enzymatic reactions. In the 
NRPS and hybrid PKS -NRPS gene group, enterobactin, 
bacitracin, pyoverdine, syringomycin, and CDAl domains 
were identified, 4 of which were reported from to bacteria 
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[67-70]. We eliminated the possibility of these genes 
originating from bacterial sequencing contamination by 
BLAST comparing all assembly contig sequences against 
the NCBI All Bacterial database with 2017 genome 
sequences, and found that the species with most hits was 
Streptomyces coelicolor with > 80% identity. However, only 
0.7% of the entire S, coelicolor genome was covered. 
Indicating that either these genes have an origin from 
bacteria or their product proteins interact with each other 
and require a highly conserved structure that was retained 
during evolution. 

Horizontal gene transfer 

According to the horizontal gene transfer hypothesis, 
A, arborescens may have acquired its CDC from an- 
other Alternaria species, from a fungus other than 
Alternaria, or possibly from a bacterium or virus [71]. 
There are at least two other possible explanations for 
its origin: (1) CDCs were present in an Alternaria ancestor, 
but were independently lost during vertical transmission in 
other non-pathogenic Alternaria species. (2) CDCs arose 
from essential chromosome as a copy first but then went 
under divergence so no obvious orthology could be 
detected. To test which of the three models fits this case 
best, we built a complete EC protein library and blasted all 
CDC proteins against it to detect any possible orthology. 
Out of 209 CDC proteins, we found 12 (5.7%) showing 
orthology to EC proteins. Although the low orthology 
percentage alone could not exclude the "duplication and 
divergence" model, taken together with differences on 
GC3-content and codon usage bias, the possibility that 
this model fits is minimal. 

To distinguish between HGT and vertical transmission 
hypothesis, we identified differences between A. 
arborescens CDC and EC genes in length, GC3-content, 
and codon usage bias. There was limited orthology 
detected between two groups; CDC genes showed 
discordant phylogenetic relation with EC, and had higher 
similarity to other fungi than A. brassicicola. From previ- 
ous phylogenetic analysis of 13 A. alternata isolates 
collected worldwide, CDC genes from different isolates 
were almost identical despite diverse EC background 
[41]. Taken these results together, we concluded that the 
HGT model may serve as the best fit model in this case. 
Additionally, these data support the theory proposed in 
1983 by Nishimura that Alternaria species acquired 
HSTsby HGT [32]. 

In this study, we identified evidence for the possibility 
of HGT event occurred in A. arborescens. For Alternaria, 
this strategy has its advantages. First, as a pathogen with 
a wide host range, as observed in nature, transportable 
pathogenicity chromosome may increase pathogen's 
adaptation to environment. Second, loss of a CDC 
when there's no host may reduce the cost of carrying 



extra genome content. Third, as asexual fungi, 
horizontal transfer may compensate the lack of genetic 
recombination. 

Conclusions 

In this study, we identified A. arborescens CDC 
sequences through a whole genome sequencing and 
de-novo assembly process. By comparing nucleotide 
usage between CDC and EC contigs, we found evidence 
supporting HGT in A. arborescens. We also identified 
some predicted CDC genes under positive selection that 
may serve as virulence factors. However, questions still 
remain, such as the similarity and difference among 
CDCs from different A. arborescens isolates. To better 
understand CDC characteristics and mechanisms of 
HGT, other Alternaria isolates need to be sequenced. 

Materials and methods 

Sequencing, assembly & alignment 

A. arborescens DNA was extracted following a protocol 
described [72] and the sequencing library was prepared 
using the Illumina Paired-End DNA Sample Prep Kit. 
Sequencing was performed using Illumina Genome 
Analyzer II. Short reads were assembled de-novo using 
Velvet, and assembly quality was improved by a pipeline 
including two alternate assemblers: Edena [44], and 
Minimus2 [45]. Parameters including k-mer length for 
Velvet and hash length for Edena were optimized by 
sequential step changes. The alignment between A. arbores- 
cens and Abrassicicola was conducted using the Nucmer 
program in the MUMmer suite [46], with parameter c = 15, 
1 = 10. Alignments with identities lower than 90% or lengths 
shorter than lOObp were removed. 

Southern hybridization 

On the CHEF gel membrane presented in Figure 2, lane 1 
contains size markers, lane 2 contains A. arborescens chro- 
mosomes that had degraded, and lane 3 contains intact A. 
arborescens chromosomes. Southern hybridization was 
conducted using the GE health CDP-Star kit with 5 gene 
probes, including 1 CDC marker gene ALTl, 3 predicted 
CDC genes, and 1 predicted EC gene. Primers (Additional 
file 1: Table S7) were designed using Primer3 [73] (vO.4.0). 
Blots were stripped between hybridizations to ensure no 
probes from previous hybridization remained. Film was 
exposed for 48 hours. 

Gene prediction, codon usage analysis & repetitive DNA 
identification 

Gene prediction was conducted using FGENESH [47], an 
ab initio gene predictor provided in the Softberry website. 
A pre-trained Alternaria matrix was used to optimize 
predictions. Both CDSs and protein sequences were gen- 
erated and converted into fasta format files. ACUA [74] 
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was used for calculating CAI and RSUC for each gene, 
and CAI distribution curves from the CDC group and EC 
group were compared to each other. Student s F-test was 
used to test statistical significance. RepeatScout [75] was 
used for de-novo identification of repeat sequences in 
both CDC and EC sequences. The repeat libraries were 
then aligned back to CDC and EC contigs using Nucmer 
to calculate the repeat percentage for each group. 

Gene annotation 

Blast2go [76] was used to annotate genes by "BLASTX" 
to the NCBI non-redundant protein database and then 
GO term assignment from the gene ontology database. 
Annotation of conserved domains was identified by scan- 
ning proteins through Pfam and NCBI CDD. PKS and 
NRPS genes were identified through scanning an online 
database SBSPKS [77]. The Fungal transcription factor 
database (FTFD) [78] was used to identify transcription 
factors. Transporters, P450s, and oxidoreducatases were 
identified based on BLAST and domain inspection. 
Potential secreted proteins were predicted using Signal 
3.0 [79]. Pathogenicity and virulence factors were identi- 
fied through scanning CDC genes in the pathogen-host 
interactions database (PHI-base) [54]. 

Estimating Ka/Ks Ratios 

A, arborescens proteins were blasted against A, brassici- 
cola proteins to generate a match list between the two 
groups with a bits score cut-off at 300. The gene 
sequences coding for aligned proteins were extracted by 
an in-house PERL script. Prank [80] was used to conduct 
codon alignment, in which two protein sequences were 
aligned first and then DNA sequences were aligned 
based on the corresponding protein alignments. The 
codon alignment result was then entered into "Codem" 
in PAML [81] (v4.0) for Ka and Ks calculation with 
model MO. In calculating, the Nei and Gojobori [82] 
method and Yang and Nielsen [83] method were used. 

Additional files 
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